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Abstract: A framework for unsupervised group activity analysis from a single video 
is here presented. Our working hypothesis is that human actions lie on a union of low- 
dimensional subspaces, and thus can be efficiently modeled as sparse linear combina- 
tions of atoms from a learned dictionary representing the action's primitives. Contrary 
to prior art, and with the primary goal of spatio-temporal action grouping, in this work 
only one single video segment is available for both unsupervised learning and analy- 
sis without any prior training information. After extracting simple features at a single 
spatio-temporal scale, we learn a dictionary for each individual in the video during 
each short time lapse. These dictionaries allow us to compare the individuals' actions 
by producing an affinity matrix which contains sufficient discriminative information 
about the actions in the scene leading to grouping with simple and efficient tools. With 
diverse publicly available real videos, we demonstrate the effectiveness of the proposed 
framework and its robustness to cluttered backgrounds, changes of human appearance, 
and action variability. 



1. Introduction 

The need for automatic and semi-automatic processing tools for video analysis is con- 
stantly increasing. This is mostly due to the acquisition of large volumes of data that 
need to be analyzed by a much limited human intervention. In recent years, significant 
research efforts have been dedicated to tackle this problem. In this work, we focus on 
the analysis of human actions, and in particular the spatio-temporal grouping of activi- 
ties. 

Our understanding of human actions and interactions makes us capable of identify- 
ing and characterizing these on relative short time intervals and in an almost effortless 
fashion. Ideally, we would like to teach a computer system to do exactly this. How- 
ever, there are challenges that exacerbate the problem, many of which come from 
the electro-optical system acquiring the data (e.g., noise, jitter, scale variations, il- 
lumination changes, and motion blur), but mostly from the inherent complexity and 
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variability of human actions (e.g., shared body movements between actions, periodic- 
ity /aperiodicity of body movements, global properties such as velocity, and local prop- 
erties such as joint dynamics). 

With the availability of large amounts of training data, the above challenges are al- 
leviated to some extent. This is at the foundation of many classification methods that 
rely on the redundancy of these large datasets, and on the generalization properties of 
modern machine learning techniques, to properly model human actions. In supervised 
human action classification, a template model for each class is learned from large la- 
beled datasets. Then, unlabeled actions are classified accordingly to the class model 
that best represents them. In this work, we focus on a very different problem, that is, 
no labeling is available and all data has to be extracted from a single video. 1 A natural 
question to ask here is what can we do when only a single unlabeled video is available ? 
Given such few data, and no a priori information about the nature of the actions, what 
we are interested in this work is in human action grouping instead of action recognition. 

Consider for example a camera observing a group of people waiting and moving 
in line in an airport security checkpoint. We would like to automatically identify the 
individuals performing anomalous (out of the norm) actions. We do not necessarily 
know what is the normal action nor the anomalous one, but are interested in knowing 
when a "different from the group" action is occurring on a given time lapse, and in 
being able to locate the corresponding individual (in space and time). Situations like 
this not only occur in surveillance applications, but also in psychological studies (i.e., 
determining outlier autistic behavior in a children's classroom, or identifying group 
leaders and followers), and in the sports and entertainment industry (e.g., identifying 
the offensive and defensive teams, or identifying the lead singer in a concert). 

We focus on modeling the general dynamics of individual actions in a single scene, 
with no a priori knowledge about the actual identity of these actions nor about the 
dynamics themselves. We propose an intuitive unsupervised action analysis framework 
based on sparse modeling for space-time analysis of motion imagery. The underlying 
idea we propose is that the activity dictionary learned for a given individual is also valid 
for representing the same activity of other individuals, and not for those performing 
different ones, nor for him/her- self after changing activity. We make the following main 
contributions: 

• Unsupervised action analysis: We extend the modeling of human actions in 
a relatively unexplored area of interest. That is, we analyze unknown actions 
from a group of individuals during consecutive short- time intervals, allowing for 
action-based video summarization from a single video source. 

• Solid performance using a simple feature: We use a simple feature descriptor 
based on absolute temporal gradients, which, in our setting, outperforms more 
sophisticated alternatives. 

• Sparse modeling provides sufficient discriminative information: We demon- 
strate that the proposed sparse modeling framework efficiently separates differ- 
ent actions and is robust to visual appearance even when using a single basic 
feature for characterization and simple classification rules. 

^ven if the video is long, we divide it into short-time intervals to alleviate the action mixing problem. 
During each short time interval, we have limited data available, and labels are never provided. 
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• Works on diverse data: We provide a simple working framework for study- 
ing the dynamics of group activities, automatically detecting common actions, 
changes of a person's action, and different activities within a group of individu- 
als, and test it on diverse data related to multiple applications. 

The remainder of the paper is structured as follows. In Section 2, we provide an 
overview of recently proposed methods for supervised and unsupervised action classifi- 
cation. Then, in Section 3, we give a detailed description of the proposed modeling and 
classification framework. We demonstrate the pertinence of our framework in Section 
4 with action grouping experiments in diverse (both in duration and content) videos. 
Finally, we provide concluding remarks and directions for future work in Section 5. 

2. Background and model overview 

In this section, we review recent techniques for human action classification which are 
related to the present work. We focus on feature extraction and modeling, and cover 
both supervised and unsupervised scenarios. 

2.1. Features for action classification 

Most of the recently proposed schemes for action classification in motion imagery 
are feature-based. In general, one could split the feature extraction process into an 
interest point detection phase and a descriptor encoding phase. Interest point detec- 
tion consists in finding spatio-temporal locations across the video where a pre-defined 
response function achieves local extrema, e.g., high spatio-temporal variations. Pop- 
ular detectors are the Spatio-temporal Interest Point detector (STIP) (Laptev, 2005), 
Cuboids (Dollar et al, 2005), and Hessian (Willems et al, 2008). Feature descriptors 
include the Cuboid feature (Dollar et al, 2005), Histogram of 3D Oriented Gradients 
(HOG3D) (Klaser et al, 2008), the combination of HOG and Histograms of Optical 
Flow (HOF) (Laptev et al, 2008), Local Motion Patterns (LMP) (Guha and Ward, 
2012), and Extended Speeded Up Robust Features (ESURF) (Willems et al, 2008), 
most of which are spatio-temporal extensions to techniques designed for still images. 

In practice, it is unclear which interest point detector and feature combination is 
the most appropriate for modeling human actions. Wang et al (2009) performed an 
exhaustive comparison of different detector/feature combinations on several datasets. 
Although individual performance depends on the dataset, dense sampling (no interest 
point detection) combined with HOG/HOF features seems a good choice in realistic 
video settings, since it captures context from the scene background. Shao and Mattivi 
(2010) performed a similar evaluation using a less realistic dataset (the KTH action 
dataset 2 ) and concluded that the Cuboid detector combined with the Local Binary Pat- 
tern on Three Orthogonal Planes (LBP-TOP) descriptor (Pietikinen and Ojala, 1996; 
Zhao and Pietikainen, 2007) gives the best performance in terms of classification accu- 
racy. 



2 http: //www.nada.kth. se/cvap/actions/ 
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Apart from these localized, low-level descriptors, there has been research focused 
on the design of more global, high-level descriptors, which encode more semantic in- 
formation. For example, Sadanand and Corso (2012) proposed to represent the video 
as the collected output from a set of action detectors sampled in semantic and in view- 
point space. Then, these responses are max-pooled and concatenated to obtain semantic 
video representations. Gorelick et al (2005) proposed to model human actions as 3D 
shapes on a space-time volume. Then, the solution of the Poisson equation is used to 
extract different types of feature descriptors to discriminate between actions. Yilmaz 
and Shah (2005) proposed to use a similar space-time volume by calculating the point 
correspondences between consecutive frames, and action representative features were 
computed by differential geometry analysis of the space-time volume's surface. Other 
types of high-level features are the joint-keyed trajectories or human pose, which have 
been used for example by Ramanan and Forsyth (2003) and Basharat and Shah (2007). 
Kliper-Gross et al (2012) proposed to encode entire video clips as single vectors by us- 
ing Motion Interchange Patterns (MIP), which encode sequences by comparing patches 
between three consecutive frames and applying a series of processes for background 
suppression and video stabilization. 

Finally, it is clear that the choice of detectors and descriptors and their respective 
performances highly depend on the testing scenarios, including acquisition properties, 
dataset physical settings, and the modeling techniques. Most of these features work 
well in the context for which they were proposed, and changing the context might 
adversely affect their performance. Let us emphasize that feature design is not our main 
goal in this paper. We next describe the feature extraction scheme used throughout 
this work, which, although very simple, works very well in all our scenario, hence 
highlighting the advantages of the very simple proposed model. 

The proposed feature. In order to properly capture the general spatio-temporal char- 
acteristics of actions, it is always desirable to have a large number of training samples. 
We aim at characterizing actions from scarce data and, under these conditions, we are 
able to properly model the actions using a simple feature (the overall scheme is illus- 
trated in Fig. 1). We start by tracking and segmenting (Papadakis and Bugeau, 2011) 
the individuals whose actions we analyze. This segmentation masks allow us to focus 
mostly on the individuals while disregarding (most of) the background. We then set a 
simple interest point detector based on the absolute temporal gradient. For each indi- 
vidual, the points where the absolute temporal gradient is large enough (i.e., it exceeds 
a pre-defined threshold) become interest points for training and modeling. The feature 
is also very simple: it consists of a 3D (space and time) absolute temporal gradient 
patch around each interest point. As we will illustrate in Section 4.1, this combination 
works better than some more sophisticated alternatives in the literature. 

2.2. Modeling actions for classification 

Once the feature samples are extracted from the video, an appropriate model is nec- 
essary to obtain valid and discriminative action representations. Bag-of- words is one 
of the most widely used models (Wang et al, 2009; Shao and Mattivi, 2010). It ba- 
sically consists of applying K-means clustering to the data to find K centroids, i.e., 
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Fig 1: Scheme of the interest point extraction method. For extracting the interest points 
for a given individual, we keep the points (1) whose temporal gradient exceeds a given 
threshold and (2) that lie inside the individual's (dilated) segmentation mask. The con- 
trast of the temporal gradient is enhanced for improved visualization. 



visual words, that are representative of all the training samples. Then, a video is rep- 
resented as a histogram of visual word occurrences, assigning one of the centroids to 
each of the extracted features in the video using (most often) Euclidean distance. These 
K centroids are found using a randomly selected subset of features coming from all the 
training data. Then, a classifier like a Support Vector Machine (SVM) can be trained 
from these histograms. 

Recently, sparse modeling has proven to be very successful in signal and image pro- 
cessing applications, especially after highly efficient optimization methods and sup- 
porting theoretical results emerged. The advantage of sparse modeling over other dic- 
tionaries fixed a priori, like Fourier or wavelet basis, is that it learns the dictionary from 
the data themselves and can thus represent them more efficiently. Sparse modeling al- 
lows to represent each sample as a linear combination of a few atoms in the dictionary 
(a subspace), as opposed to the single "atom" representation in standard bag-of-words. 
This means that sparse modeling is more flexible but still keeps a rich representation 
power, in contrast to bag-of-words representations, which are hard quantizations of 
the data samples, regardless of their distance from the closest centroid. Sparse model- 
ing has been adapted to classification tasks like face recognition (Wright et al, 2008) 
(without dictionary learning), digit and texture classification (Mairal et al, 2008), hy- 
perspectral imaging (Castrodad et al, 2011; Charles et al, 2011), and motion imagery 
(Cadieu and Olshausen, 2008; Dean et al, 2009; Guo et al, 2010; Taylor et al, 2010), 
among numerous other applications (note that techniques based on sparse modeling 
have also performed very well in the PASCAL competition). 

Several sparse modeling approaches have been recently proposed for action classi- 
fication tasks as well. Guha and Ward (2012) used learned dictionaries in three ways: 
individual dictionaries (one per action), a global (shared) dictionary, and a concatenated 
dictionary. Individual dictionaries are separately learned for each class of actions and 
unlabeled actions are assigned to the class whose dictionary gives the minimum recon- 
struction error. The concatenated dictionary is formed by concatenating all the individ- 
ual dictionaries, and unlabeled actions are assigned to the class whose corresponding 
subdictionary contributes the most to the reconstruction. To create the (shared) dictio- 
nary, a single common and unstructured dictionary is learned using all training feature 
data from every class. The dictionary coding coefficients of training actions are used 
to train a multi-class SVM. A shared (global) dictionary was also proposed by Dean 
et al (2009), where a dictionary is learned in a recursive manner by first extracting 
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high response values coming from the Cuboids detector, and then using the result- 
ing sparse codes as the feature descriptors (PC A is optionally applied). Then, as often 
done for classification, the method uses a bag-of-features approach for representing 
the videos and a nonlinear ^ 2 -SVM for classification. Guo et al (2010) built a dic- 
tionary using vectorized log-covariance matrices of 12 hand-crafted features (mostly 
derived from optical flow) obtained from entire labeled videos. Then, the vectorized 
log-covariance matrix coming from an unlabeled video is represented with this dictio- 
nary using £\ -minimization, and the video is classified by selecting the label associated 
with those dictionary atoms that yields minimum reconstruction error. Castrodad and 
Sapiro (2012) propose a two-level sparse modeling scheme, in order to capture shared 
movements from different actions, achieving highly accurate classifications. Sparse 
modeling has also been applied to abnormal event detection, which can be considered 
as a binary classification problem. Cong et al (201 1) proposed a model to detect abnor- 
mal events by means of high reconstruction errors obtained by encoding using a "nor- 
mal events" dictionary, which is constructed using a collection of local spatio-temporal 
patches from "normal events" sequences. 

More recently, human action models have been extended to account for human in- 
teractions and group activities. Khamis et al (2012) introduced a model to classify 
nearby human activities by enforcing homogeneity on both the identity and the scene 
context on a frame by frame basis. Todorovic (2012) proposed to detect and localize 
individual, structured, and collective human activities (segmented as foreground) by 
using Kronecker (power) operations on learned activity graphs, and then classify these 
based on permutation-based graph matching. Fu et al (2012) proposed a model to la- 
bel group (social) activities using audio and video by learning latent variable spaces of 
user defined, class-conditional, and background attributes. Choi and Savarese (2012) 
proposed to track and estimate collective human activities by modeling label informa- 
tion at several levels of a hierarchy of activities going from individual to collective, and 
encoding their respective correlations. Our work is similar to these in the sense that we 
seek to analyze group activities by exploiting the correlations of individual's actions. 
However, all of the above mentioned schemes require a large amount of labeled train- 
ing data, which are not available for single video analysis. For this reason, we now turn 
the attention to unsupervised approaches. 

2.3. Unsupervised setting 

In multi-class supervised classification, labeled training samples from different classes 
are required, and for anomalous events detection, "normal" training samples are needed. 
The majority of the publicly available data benchmarks for human action classification 
usually contain only one person and one type of action per video, and are usually ac- 
companied by tracking bounding boxes (Hassner and Wolf, 2012). This is different 
from our testing scenario, where only a single video (segment) is available, containing 
more than one person, without other prior annotations. 

Several works addressing unsupervised human action classification have been pro- 
posed. Niebles et al (2008) used probabilistic Latent Semantic Analysis (pLSA) and 
Latent Semantic Analysis (LSA) to first learn different classes of actions present in a 
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collection of unlabeled videos through bag-of- words, and then apply the learned model 
to perform action categorization in new videos. This method is improved by using 
spatio-temporal correlograms to encode long range temporal information into the lo- 
cal features (Savarese et al, 2008). Willem et al (2009) proposed a bag-of-words ap- 
proach using Term Frequency-Inverse Document Frequency features and a data- stream 
clustering procedure. A spatio-temporal link analysis technique combined with spec- 
tral clustering to learn and classify the classes was proposed by Liu et al (2009). All 
of these methods employ a bag-of-words approach with sophisticated features and re- 
quire training data to learn the action representations, while we only work on one single 
video segment, and a much simpler feature. Our work also departs from correlation- 
based video segmentation. These methods usually correlate a sample video clip with 
a long video to find the similar segments in the target video (Shechtman and Irani, 
2007), while our work treats all the actions in one video equally and automatically find 
the groups of the same action. The work presented here shares a similar (but broader) 
goal as Zelnik-Manor and Irani 's (2006). Their unsupervised action grouping technique 
works for a single video containing one individual, comparing the histograms of 3D 
gradients computed throughout each short- time intervals. We consider a more general 
setting in which the video length ranges between one second to several minutes, and 
contains more than one individual, with individuals performing one or more actions 
and not necessarily the same action all the time. 

Bearing these differences in mind, we now proceed to describe the proposed model 
in detail. 

3. Unsupervised modeling of human actions 

In this work, we assume there are P > 1 individuals performing simultaneous actions 
in the video. We first use an algorithm based on graph-cuts (Papadakis and Bugeau, 
2011) to coarsely track and segment the individuals. These tracked segmentations will 
be used as masks from which features for each individual will be extracted. We later 
show that these coarse masking procedure is sufficient for reliably grouping actions 
with our method. 

We first extract spatio-temporal patches from the absolute temporal gradient image, 
around points which exceeds a pre-defined temporal gradient threshold 7] . These m- 
dimensional spatio-temporal patches from the j-th person are the data used to train 
the corresponding dictionary D J ,y = 1,2, • • • ,P. Let us denote by nj the number of 
extracted patches from the j — th individual. More formally, we aim at learning a dic- 
tionary D j e R mxk j such that a training set of patches X j = [xi, . . . ,x nj ] G R mx ^ can 
be well represented by linearly combining a few of the basis vectors formed by the 
columns of D J , that is X- 7 w D J A J . Each column of the matrix A- 7 G R k J xn j is the sparse 
code corresponding to the patch from X- 7 . In this work we impose an additional non- 
negativity constraint on the entries of D J and A- 7 . This problem can then be casted as 
the optimization 

min i||X>-D>A>||| + A||A>||i,i, (1) 

(DJ,AJ)h0 2 

where >z denotes the element-wise inequality, A is a positive constant controlling the 
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trade-off between reconstruction error and sparsity (numerous techniques exist for set- 
ting this constant (e.g., Tibshirani, 1994)), || • ||i ; i denotes the t\ norm of a matrix, 
that is, the sum of its coefficients, and || • ||f denotes the Frobenius norm. Since Equa- 
tion (1) is convex with respect to the variables hJ when D J is fixed and vice versa, it is 
commonly solved by alternatively fixing one and minimizing over the other. 3 

We will show next how to use these learned dictionaries for comparing simultane- 
ously performed actions on a single time interval (Section 3.1), and to detect action 
changes of the individuals along the temporal direction (Section 3.2), with a special 
case when P = 2 in Section 3.3. Finally, a spatio-temporal joint grouping for a long 
video is presented in Section 3.4. The algorithm's pipeline is outlined in Fig. 2. 



S 

Dictionary / Temporal 

Learning I Grouping 



Fig 2: Algorithmic pipeline of the proposed method. The first three stages (keypoints 
extraction, feature extraction and dictionary learning) are common to all presented 
analysis tools, while specific techniques at the pipeline's end help answer different 
action-grouping questions. Although we propose tools for solving all the different 
stages in this pipeline, the core contribution of this work is in the modeling of ac- 
tions via dictionary learning (the corresponding stage is denoted by a rectangle). This 
allows to use very simple techniques in the previous stages and much flexibility in the 
subsequent ones. 





Spatio- 
temporal 
Groupina 



3.1. Comparing simultaneously performed actions 

On a standard supervised classification scenario, subdictionaries D J are learned for 
each human action, and are concatenated together to form a global dictionary. Then, 
new unlabeled data from human actions are represented by this global dictionary and 
are classified into the class where the corresponding subdictionary plays the most sig- 
nificant role in the reconstruction (Castrodad and Sapiro, 2012). In our case, we do not 
have labeled data for learning and classification. During reconstruction, each person 
will obviously tend to prefer its own subdictionary from the global dictionary (since 
the subdictionary is learned from the very same data), consequently inducing poor 

3 Recent developments, e.g., Xiang et al (201 1); Gregor and LeCun (2010); Mairal et al (2010); Bronstein 
et al (2012), have shown how to perform dictionary learning and sparse coding very fast, rendering the 
proposed framework very efficient. 
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discrimination power. To handle this difficulty, a "leave-one-out" strategy is thus pro- 
posed: each individual action j is represented in a global dictionary that excludes its 
corresponding subdictionary D J . 

Let us assume that for each person j G [1,^], we have learned a dictionary D J G 
M, mxk j using the patches X- 7 . We concatenate the P — 1 dictionaries {D l } i= i ^ P ^j (that 
is, without including D- 7 "), to build the dictionary G R mx/ % & = E-L^-ifc;. To test the 
similarity between the action performed by the j-th person and those performed by the 
rest of the group, we solve 

min hxj-&V\\l+X\\AJ\\ 1A . (2) 

The computed sparse-codes matrix A-i is the concatenation of the sparse codes blocks 
{A^}- =1 such that 

d7a7= [d 1 ,...,iv- 1 ,iv +1 ,...,d p ] [A^..A^^A^ r ...A^ n7 




= D l AJ l + • • • +D- / '~ 1 A^- 1 + D->' +1 Aa/+i • • • +D P A^. 

We use ||A^||i 5 i to encode the level of similarity between the action corresponding to 
the j-th person and the action corresponding to the z-th person, V/ ^ j. Let us motivate 
this choice with the following example. If two persons, j and i are performing similar 
actions and person i' is performing a different action, when trying to represent X- 7 
with the dictionary D^, a larger t\ energy (activation) is expected from the block A^ 1 
(corresponding to D*) than from that of A^ 1 ' (corresponding to D* ). We then define the 
action- similarity matrix S G R jPXjP , whose entries Sij are defined as 



(3) 



The minimum is used to enforce reciprocal action similarity, and the normalization 
ensures that comparisons between all individual actions are fair. 

We then consider the matrix S as the affinity matrix of a nonoriented weighted graph 
G. Although numerous techniques can be used to partition G, in this work we use 
a simple approach that proved successful in our experiments (recall that the expected 
number of persons P is small in a group, in contrast to a crowd, so clustering techniques, 
which rely on statistical properties of the graph, are neither needed nor appropriate). 
We simply remove the edges of G that correspond to entries Sij such that < T, for 
a given threshold T. For a properly chosen threshold, this edge removal will cause G 
to split into several connected components, and we consider each one as a group of 
persons performing t he sa me action. In an ideal scenario where all actions are equal, 
the similarity scores || A^ 1 \\ \^ (i = 1, . . . , P, i ^ j) will also be similar. Since in Equation 
(3) we normalize them, setting the threshold to \/{P — 1) seems to be a natural choice. 
However, in practice, the distribution of these coefficients is not strictly uniform. For 
example, in a video with four skeletons dancing in a synchronous fashion (see Fig. 3), 
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the similarity scores in the resulting affinity matrix still show slight variations. We thus 
set T = , where r G [0, 1] is a relaxation constant (in our experiments we found that 
r = 0.9 was sufficient to cope with this nonuniformity effect). 

3.2. Temporal analysis: Who changed action ? 

In the previous section, we presented the modeling and grouping scheme for a fixed 
time interval (a given video segment). The matrix S provides sufficient information to 
determine if there are different actions occurring during an interval, and to determine 
which individual/s are performing them. 4 Suppose that on a given interval t — 1, all P 
individuals are performing the same action, then, on the next time interval t, the first 
P—\ individuals change action while the P-th individual remains doing the same. From 
the affinity matrix S there is no way of determining if the first P — 1 persons changed 
while the P-th person remained doing the same or vice-versa. An additional step is thus 
necessary in order to follow the individuals' action evolution in the group. A possible 
solution for this problem at a small additional computational cost is as follows. 
Let the minimized energy 

^*(X/,D/) = min i||X/-D/A^||2 +A||Ai 1;1 (4) 

AJhO 2 

be the j-th individual's ^2,1 representation error with his/her own dictionary at time t. 
Then, we measure the evolution of the reconstruction error per individual as 

E>_ u = |(^*(X/_ p D/) +^*(X/,d/_ J -^*(X/_ p D/_j) -^*(X/,D/))|, (5) 

and 

Ef-i,* = - [E}_ lt ,...,E?_ lt ], (6) 

where C = Y^=i t * s a normalization constant. E ? _i^ captures the action changes 

in a per person manner, a value of E J t _ x JC close to 1 implies that the representa- 
tion for the 7-th individual has changed drastically, while implies the individual's 
action remained exactly the same. In the scenario where nobody changes action, all 
E J t _ x v V/ £ [1,P] will be similar. We can apply a similar threshold jl = r/P to detect 
the actions' time changes (note that we now have P persons instead of P — 1). 

3.3. Special case: P = 2 

If there are only two individuals in the video (P = 2), the grouping strategies from 
sections 3.1 and 3.2 become ambiguous. The similarity matrix S would have all entries 
equal to 1 (always one group), and there would be no clear interpretation of the values 

4 The same framework can be applied if we have a single individual and just want to know if all the 
activities he/she is performing in P > 2 time intervals are the same or not, each time interval taking the place 
of an "individual." 
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in Ef_i ? f. Therefore, for this particular case where P = 2, we define the one time interval 
measure (at time t) as 

E u = max / |^(x;,d/)-^(x;,d;)| |^(x/,d;)-^(x/,d/)| \ 
^ |^*(x;,d;)| ' |^*(x/,d/)| y 

where i and j denote each subject. Also, we let the evolution of the reconstruction error 
from t — 1 to t of each individual i G [1,2] be 



E\_ x t — max 



(8) 

Finally, we use the threshold \i = r/2, where a value greater than /I implies that the 
subject is performing a different action. Note that these formulations do not replace the 
algorithms in sections 3.1 and 3.2, which are more general and robust to treat the cases 
where P > 2. 



3.4. Joint spatio-temporal grouping 

The above discussion holds for a short- time video containing a few individuals. Clus- 
tering techniques, which rely on statistical properties of the data, are neither needed 
nor appropriate to handle only a few data points. Simpler techniques were thus needed. 

Now, for long video sequences, action analysis with the previously described simple 
tools becomes troublesome. Luckily, clustering methods are ideal for this scenario. 
Hence, we use them to do joint spatio-temporal action grouping (here, by spatial we 
mean across different individuals). We consider each individual in a given time interval 
as a separate entity (an individual in two different time intervals is thus considered as 
two individuals). Dictionaries are learned for each spatio-temporal individual and an 
affinity matrix is built by comparing them in a pairwise manner using equations (7) or 
(8) (notice that in this setting, the two equations become equivalent). We simply apply 
to this affinity matrix a non-parametric spectral clustering method that automatically 
decides the number of groups (Zelnik-Manor and Perona, 2004). 



4. Experimental results 

In all the reported experiments, we used n = min(min i (^ i ), 15,000) overlapping tem- 
poral gradient patches of size m=15xl5x7for learning a dictionary per individual. 
The tracked segmentation mask for each individual (Papadakis and Bugeau, 201 1) is di- 
lated to ensure that the individual is better covered (sometimes the segmentation is not 
accurate for the limbs under fast motions). Only the features belonging to the tracked 
individuals are used for action modeling and dictionary learning. The dictionary size 
was fixed to kj = 32 for all j, which is very small compared to the patch dimension 
(undercomplete). The duration of the tested video segments (short- time intervals) is 
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Table 1 

Description summary of the videos used in the experiments. 



Video 


fps 


Figures 


Description 


Skeleton a 


30 


3 


Four skeletons dancing in a similar manner. 


Long jump b 


25 


4, 14 


Three athletes in a long jump competition. 


Gym c 


30 


5 


Three persons in a gym class. 


Kids d 


30 


6 


Three kids dancing in a TV show. 


Crossing e ' f 


30 


8 


Two pedestrians crossing the street and the other one waiting. 


Jogging§' f 


30 


9 


Six persons jogging in a park. 


Dancing h,f 


30 


10 


Five persons rehearsing a dance act. 


Singing-dancing 1 


30 


7 


A singer and four dancers performing in a theater. 


Tango* 


30 


11 


Three couples dancing Tango. 


Mimicking k 


30 


13 


Gene Kelly in a self-mimicking dancing act. 


Fitness 1 


30 


12, 15 


Three persons in a fitness class. 


Outdoor 111 


30 


16 


Action classification video used by Zelnik-Manor and Irani (2006) 



a http : //www . youtube . com/watch? v=h03QBNVwX8Q 

b http: //www. youtube . com/watch?v=bia-x_linh4. 

c http : / /www . openu . ac . il/home/hassner/data/ASLAN/ASLAN .html 

d http : //www . youtube . com/watch? v=N0wPQpB4eMk 

e http : / /www . eecs . umich . edu/vision/act ivity-dataset . html 

f Bounding boxes surrounding each person are provided, and they were used instead of the tracking/segmentation ap- 
proach. 

g http : //www . eecs . umich . edu/vision/ act ivity-dataset . html 
h http : //www . eecs . umich . edu/vision/ act ivity-dataset . html 
1 http : //www . youtube . com/watch?v=R9msiIqkI34 
J http: //www . youtube . com/watch?v=IkFjg7m- jzs 
k http: //www. youtube . com/watch?v=_DC6heLMqJs 
1 http: //www. youtube . com/watch?v=BrgPzpOGBcw 

m http : //www . wisdom . weizmann . ac . il/mathusers/vision/VideoAnalysis/Demos/EventDetect ion/ 
OutdoorClusterFull .mpg 

one second for action grouping per time interval, and two seconds for temporal analy- 
sis. Longer videos (from 5 seconds to 320 seconds) were used for joint spatio-temporal 
grouping. All the tested videos are publicly available and summarized in Table l. 5 

We provide a rough running-time estimate, in order to show the efficiency of the 
proposed framework. Our non-optimized Matlab code for feature extraction takes less 
than 10 seconds to process 30 frames of a standard VGA video (640 x 480 pixels), 
containing five individuals, and about 3 seconds for 30 frames of a video with lower- 
resolution (480 x 320 pixels) containing three individuals. As for the sparse modeling, 
we used the SPAMS toolbox, 6 taking approximately 1 second to perform dictionary 
learning and sparse coding of 15,000 samples. Notice that code optimization and par- 
allelization would significantly boost the performance, potentially obtaining a real-time 
process. 

4.1. Action grouping per time interval 

We now test the classification performance of the framework described in Section 3.1 
for grouping actions per time interval. The cartoon- skeletons dancing in the Skeleton 

5 We only present a few frames for each video in the figures, please see the supplementary material and 
mentioned links for the complete videos. 

6 http : / / spams-devel . gf orge . inria . f r/ 
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video, as shown in Fig. 3, were segmented and tracked manually to illustrate that indi- 
vidual actions have intrinsic variability, even when the actions performed are the same, 
justifying the relaxation coefficient r. Notice that this effect is not a by-product of the 
tracking/segmentation procedure, since it is done manually in this example. 




Fig 3: Sample frames from the Skeleton video. Left: Four skeletons dancing in the same 
manner were manually segmented and tracked during a single one second interval. 
Center: Affinity matrix before binarization. The slight differences in the entries are 
due to the intrinsic variability of the action itself. Right: Binarized affinity matrix after 
applying the threshold T = 0.9/3 = 0.3. The values in the entries show slight variation 
but all larger than 0.3, so they are all binarized to 1, and thus the four skeletons are 
grouped together. 



We then analyzed the Long jump video, shown in Fig. 4, where three persons are 
performing two actions: running and long-jumping. We tested every possible configu- 
ration of three video individuals (with corresponding time segments cropped from the 
original video), showing that we are doing action classification and not person classifi- 
cation. These results are summarized in Table 2, where we obtained one single grouping 
error (see the third configuration). 

Table 2 

Three people running (R) and then long- jumping (LJ). The grouping decision is represented with letters A 

and B, with only one grouping error on the third configuration (cells are colored to facilitate the 
comparison between the results and the ground truth, matching colors means correct result). See Fig. 4 for 

sample frames from this video. 

Persons Action grouping 



Ground 



R 


LJ 


LJ 


R 


R 


R 


LJ 


LJ 


R 


LJ 


R 


LJ 


R 


LJ 


R 


LJ 


R 


LJ 


R 


R 


LJ 


LJ 


LJ 


R 





I 


A 


A 


A 


A 


A 


A 


A 


A 


Result 


II 


A 


A 


A 


B 


A 


B 


B 


A 




III 


A 


A 


A 


A 


B 


B 


A 


B 



The test for the Gym video, shown in Fig. 5, consists of three persons performing 
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Fig 4: Sample frames from the Long jump video. On each row, an athlete is running and 
then long-jumping. Colored segmentation masks are displayed. (This is a color figure.) 



two actions: punching and dancing. The results are shown in Table 3. In this test, we 
again obtained only one grouping error. 




Fig 5: Sample frames from the Gym video. On the first row, the three individuals are 
punching; on the second row, they are dancing. The segmentation masks for the differ- 
ent individuals appear in different colors. (This is a color figure.) 



To further validate that we are doing action classification (not just classifying the 
person itself), and that we can treat more imbalanced cases (most individuals perform- 
ing an action), we also conducted a 'cloning' test, using the Gym video (Fig. 5). In this 
scenario, we added a fourth person by artificially replicating one of the original three, 
but performing (in a different video segment) an action different than the one of the 
original time interval. The results are shown in Table 4. A correct grouping result was 
attained, confirming that the proposed method only perceives actions, and is robust to 
differences in human appearance. 

On a more imbalanced scenario, we analyzed the Singing-dancing video, where five 
individuals are dancing while the remaining one is singing, see Fig. 7. From the affinity 
matrix, we observe that the row and column corresponding to the second (singing) 
person have smaller values, binarized to zero after applying the thresholding operation. 
The threshold in this case is T = 0.9/4 = 0.225, slightly larger than the values 0.21 in 
entries (1,4) and (4, 1), which should be binarized to 1. Since the persons are grouped 
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Fig 6: Sample frames from the Kids video. Fist row: dancing. Second row: jumping. 
The segmentation masks for the different individuals appear in different colors. (This 
is a color figure.) 



Table 3 

Analysis of the Gym video. Three persons are punching (P) or dancing (D) (cells are colored to facilitate 
the comparison). The grouping decision is shown with values A or B, with only one grouping error. See 

Fig. 5 for some typical frames. 



Persons Action Grouping 



Ground 



p 


D 


P 


D 


D 


D 


P 


P 


p 


D 


D 


P 


D 


P 


D 


P 


p 


D 


D 


D 


P 


P 


P 


D 





I 


A 


A 


A 


A 


A 


A 


A 


A 


Result 


II 


A 


A 


B 


B 


A 


A 


B 


A 




III 


A 


A 


B 


A 


B 


A 


A 


B 



as connected components, we still obtain the correct groups. More imbalanced tests for 
the Jogging, Dancing and Crossing videos (shown in figures 8, 9 and 10) also give 
correct grouping results. 

Two additional tests were performed using the Tango video, where there are three 
couples dancing Tango (Fig 11). Instead of treating each individual separately, we con- 
sidered each couple as a single entity and applied the proposed method. The returned 
affinity matrix shows that the proposed method correctly groups the three couples as 
one group (if they are performing the same activity) or two groups (if they are perform- 
ing different activities). 7 

Let us now turn our attention to the employed features to point out the effectiveness 
of our simple approach. We conducted experiments with 24 configurations of the video 
segments from the Long jump, the Gym, and the Kids videos, see figs. 4, 5, and 6 (8 
configurations from each of them, similar to the configurations in tables 2 and 3). We 
compared our simple feature (temporal gradient detector) against several feature detec- 
tors (the cuboid (Dollar et al, 2005), and Harris 2D (Harris and Stephens, 1988) (also 

7 A video with the results of this experiments is available at: http : //youtu . be/WwA j SU_RuXA. Bound- 
ing boxes of different color indicate different actions. 
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Table 4 

Analysis of the Gym video. Three persons are punching (P) or dancing (D), and a 'clone' is added. The 
second column denotes the person's index. The fourth person, that is the clone, is the same as one of the 
first three, but is doing something different. For example, I-D means that the person I was cloned. The 
(perfect) grouping decision is shown inside the table (cells are colored to facilitate the comparison). See 

Fig. 5 for some typical frames. 



Persons 



Action Grouping 



Ground 
Truth 



I 

II 
III 

'Clone' 



p 


P 


P 


P 


D 


D 


D 


p 


P 


P 


P 


D 


D 


D 


D 


P 


P 


P 


D 


D 


D 


II-D 


I-D 


II-D 


III-D 


I-P 


II-P 


III-P 





A 
A 


A 


A 


A 


A 


A 


A 


Result 


A 


A 


A 


A 


A 


A 


B 


A 


A 


A 


A 


A 


A 


'Clone' 


B 


B 


B 


B 


B 


B 


B 



see the local motion patterns (LMP) by Guha and Ward (2012)) and descriptors (our 
3D temporal gradient patches, HOG3D (Klaser et al, 2008), and the cuboid (Dollar 
et al, 2005)). The Harris 2D detector detects spatially distinctive corners in each im- 
age frame, while the cuboid detector relies on applying separable linear filters, which 
produce high responses at local image intensities containing periodic frequency com- 
ponents. It also responds to any region with spatially distinct characteristics undergoing 
a complex motion (Dollar et al, 2005). It is important to mention that both detectors 
produce fewer feature points than the temporal gradient detector. As for the descriptors, 
they all produce vectors with comparable dimensionalities: m = 1,575 for the tempo- 
ral gradient patch, m = 1,440 for the cuboid descriptor, and m = 1,000 for HOG3D. 
The results are presented in Table 5, where the best grouping performance is obtained 
with the proposed detector/descriptor based on the temporal gradient. According to the 
evaluation by Wang et al (2009), HOG3D performs well in combination with dense 
sampling, which can capture some context information. But this is not appropriate in 
our unsupervised grouping framework for single videos. The cuboid detector gives 
good results in combination with the temporal gradient descriptor and HOG3D, but the 
cuboid descriptor seems to under perform. The Harris 2D detector only produces 2D 
feature points, which do not necessarily undergo significant motion over time. Even 
though we employ temporal gradient here, we are not claiming that the other features 
are intrinsically or generally bad, since they work extremely well on supervised sce- 
narios. Nevertheless, for the problem at hand, where data is scarce, our simple feature 
performs better. 

4.2. Temporal analysis experiments 

To test the proposed strategy for dealing with temporal action changes, we processed 
several video configurations with two consecutive time intervals. The main goal is to 
identify the individuals who changed actions. 

Table 6 summarizes the results by applying the method described in Section 3.2 to 
the Long jump video (Fig. 4). Correct results were obtained when analyzing a video 



Z. TANG et al./Are You Imitating Me? 



17 




Fig 7: Sample frames from the Singing-dancing video. Left: Five persons, a singer 
and four dancers, were tracked/segmented during a one second interval (the masks are 
displayed in colors on the bottom left, the singer appearing in green). Center: Affinity 
matrix. Right: Binarized affinity matrix after applying the threshold T = 0.9/4 = 0.225. 
Note that the values in the entries of the second row and the second column (except the 
diagonal entries) are small, hence binarized to zero. This implies that the second person 
is doing a different action than the group. The binarization on entries (1,4) and (4,1) 
fails to be 1, not affecting the grouping, since the persons are grouped as connected 
components. Two groups are correctly detected, the four dancers and the singer. (This 
is a color figure.) 



subset of the involved persons changing their actions (experiments 3 and 4). In the 
case where all the persons change action simultaneously or keep doing the same action 
(experiments 1 and 2), we observe incorrect results. The proposed framework for rep- 
resenting actions is not the source of this issue. It is a consequence of the normalization 
in Equation (6), which compares individuals who are either all changing their action or 
all continuing their previous action and hence provides no discriminative power in this 
case. 

Although it is not easy to extract a general rule for every possible scenario, the 
vector [E l t _ x V E\ f+2 , •••] (see Equation (6, p. 10)) provides useful information 
about how an individual's actions evolves over time. An example is shown in Fig. 12, 
where we build this vector for one individual in the Gym video (see Fig. 15), using 
seven consecutive one-second time intervals (from t to t + 6). During this time lapse, 
the individual changes his action one time. More complex rules can be derived from 
this readily available information. The action-change rule provided in Section 3.2 will 
nonetheless be already useful in many cases. 

Finally, we present an example using the Mimicking video for the special case (P=2) 
described in Section 3.3. It consists of two seconds interval (t — 1 and t) from a com- 
edy show, where two dancers (i and j) are mimicking each other (see Fig. 13). Using 
equations (7) and (8), and a threshold jl = 1/2, we obtain E l ^_\ = 2.29, E l t J = 1.07, 
E\_ x t = 1.62, and E J t _ x t = 1.64. These results correctly imply that the dancers were 
performing different actions on each of the two seconds, and that both dancers went 
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1 
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Fig 8: Left: A sample frame of the Crossing video. Two pedestrians at left are crossing 
the street while the one at right is waiting. The provided bounding boxes were used in- 
stead of running the tracking/segmentation algorithm. Center: Affinity matrix. Right: 
Binarized affinity matrix after applying the threshold T = 0.9/2 = 0.45. Two groups 
are correctly detected. (This is a color figure.) 



through an action change from t — 1 to t. 
4.3. Joint spatio-temporal grouping 

We further analyzed three videos, i.e., the Long jump, Fitness, and Outdoor videos, 
in which human actions are jointly grouped in time and space (recall that by space 
we mean across individuals), applying the nonparametric spectral clustering algorithm 
by Zelnik-Manor and Perona (2004) to the pairwise affinity matrix described in Sec- 
tion 3.4. 

The first experiment, using 5 seconds from the Long jump video is shown in Fig. 14. 
The 3 individuals are first running then long-jumping. We thus consider that there are 
15 = 5-3 individuals in the video. The clustering algorithm on this 15x15 affinity ma- 
trix gives 2 correct clusters, even though the three athletes have different appearance. 9 

The second experiment was conducted on a 40 seconds sequence from the Fitness 
video, in which three individuals are doing gym exercises (Fig. 15). Notice that, even 
though their actions are synchronized, we do not provide this information a priori to 
the algorithm. The clustering algorithm returned 5 clusters from the 120 x 120 affinity 
matrix, and 4 should have been found. 10 There is an over splitting for the individual in 
the middle from frame 300 to frame 690, meaning that in this case either auto-similarity 
was captured in excess or the action of this person is actually different (such granularity 
in the action can actually be observed by carefully watching the video). There are also 
some clustering incorrect results in the transition period between two actions due to 

8 Videos with the results of this experiment is available at: http://youtu.be/I922vARiGko, http: 
//youtu.be/skf JSs5-f tl and http://youtu.be/S04x8YeSbao. Bounding boxes of different color 
indicate different actions. 

9 A video with the results of this experiment is available at: http : //youtu . be/9KQa9mFXBIk. 

10 A video with the results of this experiment is available at: http : //youtu . be/5moGG2e4PXc. 
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Fig 9: Left: A sample frame of the Jogging video. The provided bounding boxes were 
used instead of running the tracking/segmentation algorithm. Center: Affinity matrix. 
Right: Binarized affinity matrix after applying the threshold T = 0.9/5 = 0.18. Note 
that some entries are a little smaller than the threshold 0.18, thus binarized to be 0. But 
the grouping result is still correct since persons are grouped as connected components. 
One single group is correctly detected. (This is a color figure.) 

Table 5 

The grouping classification accuracy on 24 example configurations from the Long jump, the Gym, and the 
Kids videos by using several detectors/descriptor combinations. 



Detector 


Descriptor 




Temporal gradient HOG3D a 


Cuboid b 


Temporal gradient 


87.5% 75.0% 


d 


Cuboid b 


79.1% 79.1% 


66.7% 


Harris 2D C 


75.0% 75.0% 


d 



a Klaser et al (2008), implementation available at http: //lear . 

inrialpes . f r/~klaeser/sof tware_3d_video_descriptor. 
b Dollar et al (2005), implementation available at http : //vision . ucsd . 

edu/~pdollar/ f iles/ code/ cuboids/. 
c Harris points are detected in each frame. Patches around these keypoints 

are used to construct the spatio-temporal descriptors. This is similar to the 

local motion patterns (LMP) proposed in Guha and Ward (2012). 
d A separate implementation of the cuboid descriptor is not available. 

temporal mixing effects (our intervals are arbitrarily fixed and may incorrectly divide 
the actions during the switch). Note also that the ground truth was manually built from 
visual observation and is also fixed to having hard action transitions. Considering over- 
lapping or shorter segments will alleviate this issue if additional temporal accuracy is 
needed. 

In the third experiment we processed the Outdoor video (Fig. 16). The video con- 
tains only one individual per frame, which changes appearance (different individuals 
appear over time with different clothing). This video exhibits very slow motions and in 
order to capture enough action information in the temporal direction, we first subsam- 
pled the video by a factor of 2 in the temporal direction before applying the proposed 
method. The clustering is consistent with visual observation. 11 We observe some in- 



11 A video with the results of this experiment is available at: http : //youtu . be/l-X0-D9qRRg. 
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Fig 10: Left: A sample frame of the Dancing video. The provided bounding boxes were 
used instead of running the tracking/segmentation algorithm. Center: Affinity matrix. 
Right: Binarized affinity matrix after applying the threshold r = 0.9/4 = 0.225. Note 
that some entries are a little smaller than the threshold 0.225, thus binarized to be 0. But 
the grouping result is still correct since persons are grouped as connected components. 
One single group is correctly detected. (This is a color figure.) 

Table 6 

Temporal analysis of the Long jump video. Three persons in a race track on consecutive time intervals. 'R' 
and 'LJ' denote running and long-jumping, respectively. A value above fl — 0.3 in the action evolution 
vector E ? _i^ means that person's action has changed. 



Experiment 1 Experiment 2 Experiment 3 Experiment 4 

Person Interval Interval Interval Interval 

t-1 t t-l t t-l t t-l t 



A 


R 


R 


0.378 


R 


LJ 


0.309 


R 


R 


0.133 


R 


R 


0.202 


B 


R 


R 


0.336 


R 


LJ 


0.353 


R 


LJ 


0.437 


R 


R 


0.142 


C 


R 


R 


0.286 


R 


LJ 


0.338 


R 


LJ 


0.430 


R 


LJ 


0.656 



correct labels, again due to the fixed transition period between two actions, that is, one 
time interval can contain two actions and the action type is not well defined in this 
situation for the corresponding time segment. 

5. Concluding remarks 

We presented an unsupervised sparse modeling framework for action-based scene anal- 
ysis from a single video. We model each of the individual actions independently, via 
sparse modeling techniques, and build a group affinity matrix. Applying relatively sim- 
ple rules based on representation changes, the proposed method can efficiently and 
accurately tell whether there are different actions occurring on the same short-time in- 
terval, and across different intervals, including detecting possible action changes by 
any of group members. In addition, we extended the method to handle longer mo- 
tion imagery sequences by applying standard spectral clustering techniques to a larger 
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(b) Two different activities are correctly identified (the couples on the left and right are grouped together, 
while the couple in the middle is isolated) 

Fig 11: Left: Sample frames from the Tango video, where three couples are danc- 
ing Tango during a one second interval were tracked/segmented (masks are displayed 
in different colors on the bottom left). Each couple was treated as a single entity. 
Center: affinity matrix. Right: Binarized affinity matrix after applying the threshold 
T = 0.9/2 = 0.45. (This is a color figure.) 



spatio-temporal affinity matrix. We tested the performance of the framework with di- 
verse publicly available datasets, demonstrating its potential effectiveness for diverse 
applications. 

We also showed that by using a single and simple feature in such a scarce data 
scenario outperforms standard and more sophisticated ones. This indicates that further 
research on good features for unsupervised action classification is much needed. 

We are currently working on extending the model to handle interactions, that is, 
meta-actions performed by several persons. Also, going from purely local features to 
semi-local ones by modeling their interdependences might provide a way to capture 
more complex action dynamics. Finally, action similarity is an intrinsically multiscale 
issue (see example in Figure 14, where the middle lady is performing the same coarse 
action but in a different fashion), therefore calling for the incorporation of such con- 
cepts in action clustering and detection. 
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E t+2,t+3 E t+3,t+4 




t,t+l t+l,t+2 t+2,t+3 t+3,t+4 t+4,t+5 t+5,t+6 

Fig 12: The vector [E l t ?+2 , •••^+5 ^+5] ( see Equation (6, p. 10)) over seven con- 

secutive time intervals for one individual in the Gym video (see Fig. 15), during which 
he/she performs two different actions. We can see that the values of E l tt+1 , , +2 , 
E l t + 4t + 5 , and 2^ +5f+6 are small, reflecting no change of action in that interval. The 
other values (E l t ^ 2 ^+4 and ££ +3 f+4 ) are relatively big because the individual is chang- 
ing actions. The transition between actions is not instantaneous, lasting for about two 
seconds. 
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Fig 13: On the first column, two frames from the Mimicking video. On the remaining 
columns, the tracking/segmentation masks are displayed in colors. The two dancers are 
correctly detected as performing different actions. (This is a color figure.) 
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Fig 14: 5 seconds from the Long jump video, where three athletes are running and long- 
jumping, see sample frames in Fig. 4. Our method correctly identifies two actions. 
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of four. Between frames 300 and 690, all three persons are doing the same corase action 
and this over- splitting can be explained by granular action variability, where the person 
in the middle presents auto-similarity (she is somewhat more energetic than the others). 
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Fig 16: 320 seconds from the Outdoor video. In accordance with visual observation, 
seven clusters are identified. There are a few clustering errors in the transition periods 
between the actions due to the discrete temporal nature of the particular analysis here 
exemplified. 
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